Splitting compounds with ngrams

نویسنده

  • Naomi Tachikawa Shapiro
چکیده

Compound words with unmarked word boundaries are problematic for many tasks in NLP and computational linguistics, including information extraction, machine translation, and syllabification. This paper introduces a simple, proof-of-concept language modeling approach to automatic compound segmentation, demonstrated with Finnish. The approach utilizes an off-the-shelf morphological analyzer to split training words into their constituent morphemes. A language model is subsequently trained on ngrams composed of morphemes, morpheme boundaries, and word boundaries. Finally, linguistic constraints are used to weed out phonotactically ill-formed segmentations, thereby allowing the language model to select the best grammatical segmentation. This approach achieves an accuracy of ∼97%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Accounting ngrams and multi-word terms can improve topic models

The paper presents an empirical study of integrating ngrams and multi-word terms into topic models, while maintaining similarities between them and words based on their component structure. First, we adapt the PLSA-SIM algorithm to the more widespread LDA model and ngrams. Then we propose a novel algorithm LDA-ITER that allows the incorporation of the most suitable ngrams into topic models. The...

متن کامل

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

We developed a search tool for ngrams extracted from a very large corpus (the current system uses the entire Wikipedia, which has 1.7 billion tokens). The tool supports queries with an arbitrary number of wildcards and/or specification by a combination of token, POS, chunk (such as NP, VP, PP) and Named Entity (NE). It outputs the matched ngrams with their frequencies as well as all the context...

متن کامل

Occurrence Based Statistics in Machine Translation

As MT approaches demand longer context for better translation quality, the limitations of current language modeling techniques become explicit. The computational inability to model the likelihood of longer ngrams and the likelihood of their usage in probabilistic manner, have prevented us from exploring long ngrams in MT. In this paper, we propose and investigate a new set of features called oc...

متن کامل

Finding the Correct Interpretation of Swedish Compounds, a Statistical Approach

This paper treats compound splitting for Swedish, where compounding is productive and very common. A method for splitting compounds and several methods for choosing the correct interpretation of ambiguous compounds are presented. 99% of all compounds are split, 97% of these are correctly interpreted.

متن کامل

Effects of Location in the Tree Canopy on Some Quality Characteristics of Fresh Pistachio Fruit

Fresh pistachio fruit cv. Kalleghochi was harvested from the exterior and interior parts of the tree canopy in four geographical directions. The fruit position in exterior and interior parts of the tree canopy has a significant influence on the number of nuts per ounce, pistachio splitting, hull weight, shell weight, kernel weight, colour indices and total anthocyanin content. Results indicated...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016